Battle royale games have surged in popularity in recent years. The premise of such games is as follows: players are dropped onto a fictional island and fight to be the last person standing. As they roam around the island, they loot for weapons and items crucial for their survival. Players can choose to join a game as a solo player or with a group of friends (4 players maximum). When playing solo, players are immediately eliminated when they are killed. However, in group play, killed individuals can be revived by their teammates.
We are interested in building a prediction model for the popular battle royale game PUBG (PlayerUnknown’s Battlegrounds). In PUBG, players not only have to worry about getting killed by other players, but they also have to stay within the shrinking “safe zone,” which effectively forces players into contact with each other. Outside of the “safe zone,” players take damage to their health at increasing rates.
Through our analysis, we aim to understand which playing strategies are more successful than others: How aggressive are the playing styles of the winners? Is it better to land in a densely or sparsely populated area? Do players who travel farther on the map tend to place higher or lower? Answers to such questions will be of high interest for the PUBG gaming community.
First, we want to investigate how well we can predict a player’s placement based on their in-game actions. What actions or statistics are most predictive of their placement? Exploring this question can then provide insight into how different playing styles compare. We would like to be able to build a model that accurately predicts a player’s game performance, but also allows us to draw inferences about whether certain playing styles are more successful.
The data comes from the Kaggle competition. To download the data, join the Kaggle competition and run the shell script download_data.sh.
Note: We will need to provide a direct download link for the TA.
data.url <- paste0("https://www.dropbox.com/s/319vkfevkfb6kqt/all.zip?dl=1")
if(!file.exists("./data/pubg.zip")){
dir.create("./data")
download.file(data.url, destfile = "./data/pubg.zip", mode = "wb")
unzip("./data/pubg.zip", exdir = "./data/pubg")
}
# Warning: Very large datasets. Read 10000 random samples before scaling up.
raw_dat <- read_csv("data/pubg/train_V2.csv") %>% sample_n(10000)
test_dat <- read_csv("data/pubg/test_V2.csv")
Each row in the data contains one player’s post-game stats. A description of all data fields is provided in pubg_codebook.csv. We will focus on the solo game mode (match_type is solo, solo-fpp, or normal-solo-fpp). The solo game mode constitutes about 15% of the data. The outcome variable we are trying to predict is win_place_perc.
# Select single-player data only
# Clean names
# Remove features that are not relevant to single-players
# Change player_id and match_id to factors
# Change `kill_points` and `win_points` to NA if rank points = -1 and kill/win_points = 0
# Change rank_points to NA if rank_points = -1
# Drop rows missing data for win_perc
clean_dat <- raw_dat %>%
clean_names() %>%
filter(match_type %in% c("solo", "solo-fpp", "normal-solo-fpp")) %>%
select(-dbn_os, -assists, -revives, -group_id, -match_type, -team_kills) %>%
mutate(kill_points = ifelse(rank_points == -1 & kill_points == 0, NA, kill_points),
win_points = ifelse(rank_points == -1 & win_points == 0, NA, win_points),
rank_points = ifelse(rank_points == -1, NA, rank_points)) %>%
mutate(id = as.factor(id), match_id = as.factor(match_id)) %>%
drop_na(win_place_perc)
We are given a training set and a test set. The outcome variable for the test set will not be given to us until the end of the Kaggle competition in Jan. 30th, 2019. Therefore, for the purposes of this project, we will only be using the provided training set. Within the training set, we will create our own training and test set.
# Split into train and test set
train_ind = createDataPartition(y = clean_dat$win_place_perc, p = 0.8, list = F)
train = clean_dat %>%
slice(train_ind)
test = clean_dat %>%
slice(-train_ind)
head(train)
# A tibble: 6 x 23
id match_id boosts damage_dealt headshot_kills heals kill_place
<fct> <fct> <int> <dbl> <int> <int> <int>
1 315c… 6dc8ff8… 0 100 0 0 45
2 311b… 2926117… 0 8.54 0 0 48
3 b780… 2c30ddf… 1 324. 1 5 5
4 9202… 07948d7… 3 254. 0 12 13
5 4714… bc2faec… 0 137. 0 0 37
6 0ba4… f7cb761… 0 194. 1 1 19
# ... with 16 more variables: kill_points <int>, kills <int>,
# kill_streaks <int>, longest_kill <dbl>, match_duration <int>,
# max_place <int>, num_groups <int>, rank_points <int>,
# ride_distance <dbl>, road_kills <int>, swim_distance <dbl>,
# vehicle_destroys <int>, walk_distance <dbl>, weapons_acquired <int>,
# win_points <int>, win_place_perc <dbl>
In our data set, we have 1181 players with no duplicates (i.e. no player has participated in more than one game). Additionally our training set consists of data for 1181 matches with data for on average around 80% of players.
# Look at unique players
length(train$id)
[1] 1286
n_distinct(train$id)
[1] 1286
# Look at unique matches
n_distinct(train$match_id)
[1] 1181
# Look at players per game
# Calculate proportion of data we have for individuals in a game,
# using max place as our estimate for number of players
train %>% group_by(match_id, max_place, match_duration) %>% count() %>%
mutate(prop_data = n/max_place)
# A tibble: 1,181 x 5
# Groups: match_id, max_place, match_duration [1,181]
match_id max_place match_duration n prop_data
<fct> <int> <int> <int> <dbl>
1 00086e740a5804 98 1962 1 0.0102
2 0022adebf59be6 96 1381 1 0.0104
3 007e905294c254 96 1482 1 0.0104
4 00817c4617e086 99 1781 1 0.0101
5 00a4afdb911815 89 1356 1 0.0112
6 00c6ce68aebd56 96 1356 1 0.0104
7 00d76f8d3229dc 95 1453 1 0.0105
8 00efee3dd33d5f 98 1354 1 0.0102
9 01435d33376fd0 15 1808 1 0.0667
10 0174d36366246a 99 1353 1 0.0101
# ... with 1,171 more rows
We first explored the distribution of each feature by the final finish percentile. Individuals were first classified into 0-19th, 20th-39th, 40th-59th, 60th-79th, 80th-99th, and 100th (winners) percentile finish across all games. Then, we plotted the density of features by these percentiles.
It is important to note that the density plots aggregate by percent finish in a game. Thus, it is possible for one individual to place within the 10th percentile in one game, but then finish in the 90th percentile in another. This individual would contribute to the approximated density for both the 0-19th percentile and the 80th-99th percentile.
train %>% mutate(win_place_cat = as.factor(floor(win_place_perc / 2 * 10) * 20)) %>%
gather("feature", "value", -match_id, -match_duration,
-id, -win_place_perc, -win_place_cat) %>%
ggplot(aes(x = value, group = win_place_cat, color = win_place_cat)) +
facet_wrap(feature ~., scales = "free") +
geom_density() +
labs(title = "Distribution of Features by Finish Percentile",
x = "Value of Features", y = "Density", color = "Percentile") +
scale_color_hue(labels = c("0-19", "20-39", "40-59", "60-79", "80-99", "100")) +
theme_bw()
This plot has some very interesting features:
boost and heals used suggests that players who use more boosts or healing items are likely to last longer in the game. This makes intuitive sense as boosts enable players to have increased passive health regeneration and movement speed, and healing items regain health.damage_dealt, we see similar differences among players by their finish percentile. Winner (e.g. 100th percentile) have a broad distribution in damage dealt suggesting that some solo players may win by not having high damage dealt while other deal significantly more damage. It is important to note that winners must have killed at least one individual. Thus, it is expected that the damage dealt distribution for winners is shifted to the right in comparison to players with lower finish percentiles.kill_place, kill_points, kills, and win_points follow bimodal distributions. This may reflect the play-styles of each player. Players who land in populated areas are more likely to encounter other players, resulting in a higher porbability of dying or a larger number of kills if the player survives. Thus, we can partition players in the 10th percentile finish into two categories: a skilled player who but dies early due to dropping in a populated location, but due to their skill acquires a large number of kills or a less-skilled player who dies early due to lack of skill despite dropping in a less populated location.longest_kill, ride_distance, swim_distance, ride_distance, etc.). We may want to log-transform these variables in our model building.num_groups density plots suggest that in games where we have little data, we tend to have data on the winners. Thus, there may be some imbalance in the data we will need to either adjust for to ensure that our model doesn’t overestimate finish percentile.rank_points, win_points and kill_points are external characteristics (from previous games) that attempt to characterize the skill level of a player. These distributions are bimodal which may reflect the extremes of the two playstyles described above. It seems that kill_points has more predictive value of finish percentile as the right-shift is more distinct by finish percentage category than rank_points. Interestingly, rank_points suggests that prior-game ranks do not have a large impact on the final placement in a game (though there is a note int the pubg_codebook.csv file that this metric is deprecated). This makes sense since in-game variables like drop location, loot, and circle movement can affect how likely an individual is to win.Statistics related to kills seem to be well correlated with finish percentile. Additional duration of game does not seem to be strongly correlated with many of the in-game features such as kills, walk_distance, etc.
corr_matrix = test %>% select(-id, -match_id) %>% cor()
corrplot(corr_matrix, method = "circle")